4,617 research outputs found
Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions
We consider the least-squares regression problem and provide a detailed
asymptotic analysis of the performance of averaged constant-step-size
stochastic gradient descent (a.k.a. least-mean-squares). In the strongly-convex
case, we provide an asymptotic expansion up to explicit exponentially decaying
terms. Our analysis leads to new insights into stochastic approximation
algorithms: (a) it gives a tighter bound on the allowed step-size; (b) the
generalization error may be divided into a variance term which is decaying as
O(1/n), independently of the step-size , and a bias term that decays as
O(1/ 2 n 2); (c) when allowing non-uniform sampling, the choice of a
good sampling density depends on whether the variance or bias terms dominate.
In particular, when the variance term dominates, optimal sampling densities do
not lead to much gain, while when the bias term dominates, we can choose larger
step-sizes that leads to significant improvements
AdaBatch: Efficient Gradient Aggregation Rules for Sequential and Parallel Stochastic Gradient Methods
We study a new aggregation operator for gradients coming from a mini-batch
for stochastic gradient (SG) methods that allows a significant speed-up in the
case of sparse optimization problems. We call this method AdaBatch and it only
requires a few lines of code change compared to regular mini-batch SGD
algorithms. We provide a theoretical insight to understand how this new class
of algorithms is performing and show that it is equivalent to an implicit
per-coordinate rescaling of the gradients, similarly to what Adagrad methods
can do. In theory and in practice, this new aggregation allows to keep the same
sample efficiency of SG methods while increasing the batch size.
Experimentally, we also show that in the case of smooth convex optimization,
our procedure can even obtain a better loss when increasing the batch size for
a fixed number of samples. We then apply this new algorithm to obtain a
parallelizable stochastic gradient method that is synchronous but allows
speed-up on par with Hogwild! methods as convergence does not deteriorate with
the increase of the batch size. The same approach can be used to make
mini-batch provably efficient for variance-reduced SG methods such as SVRG
On the Consistency of Ordinal Regression Methods
Many of the ordinal regression models that have been proposed in the
literature can be seen as methods that minimize a convex surrogate of the
zero-one, absolute, or squared loss functions. A key property that allows to
study the statistical implications of such approximations is that of Fisher
consistency. Fisher consistency is a desirable property for surrogate loss
functions and implies that in the population setting, i.e., if the probability
distribution that generates the data were available, then optimization of the
surrogate would yield the best possible model. In this paper we will
characterize the Fisher consistency of a rich family of surrogate loss
functions used in the context of ordinal regression, including support vector
ordinal regression, ORBoosting and least absolute deviation. We will see that,
for a family of surrogate loss functions that subsumes support vector ordinal
regression and ORBoosting, consistency can be fully characterized by the
derivative of a real-valued function at zero, as happens for convex
margin-based surrogates in binary classification. We also derive excess risk
bounds for a surrogate of the absolute error that generalize existing risk
bounds for binary classification. Finally, our analysis suggests a novel
surrogate of the squared error loss. We compare this novel surrogate with
competing approaches on 9 different datasets. Our method shows to be highly
competitive in practice, outperforming the least squares loss on 7 out of 9
datasets.Comment: Journal of Machine Learning Research 18 (2017
Regularized Nonlinear Acceleration
We describe a convergence acceleration technique for unconstrained
optimization problems. Our scheme computes estimates of the optimum from a
nonlinear average of the iterates produced by any optimization method. The
weights in this average are computed via a simple linear system, whose solution
can be updated online. This acceleration scheme runs in parallel to the base
algorithm, providing improved estimates of the solution on the fly, while the
original optimization method is running. Numerical experiments are detailed on
classical classification problems
Learning with Clustering Structure
We study supervised learning problems using clustering constraints to impose
structure on either features or samples, seeking to help both prediction and
interpretation. The problem of clustering features arises naturally in text
classification for instance, to reduce dimensionality by grouping words
together and identify synonyms. The sample clustering problem on the other
hand, applies to multiclass problems where we are allowed to make multiple
predictions and the performance of the best answer is recorded. We derive a
unified optimization formulation highlighting the common structure of these
problems and produce algorithms whose core iteration complexity amounts to a
k-means clustering step, which can be approximated efficiently. We extend these
results to combine sparsity and clustering constraints, and develop a new
projection algorithm on the set of clustered sparse vectors. We prove
convergence of our algorithms on random instances, based on a union of
subspaces interpretation of the clustering structure. Finally, we test the
robustness of our methods on artificial data sets as well as real data
extracted from movie reviews.Comment: Completely rewritten. New convergence proofs in the clustered and
sparse clustered case. New projection algorithm on sparse clustered vector
Convex Relaxations for Permutation Problems
Seriation seeks to reconstruct a linear order between variables using
unsorted, pairwise similarity information. It has direct applications in
archeology and shotgun gene sequencing for example. We write seriation as an
optimization problem by proving the equivalence between the seriation and
combinatorial 2-SUM problems on similarity matrices (2-SUM is a quadratic
minimization problem over permutations). The seriation problem can be solved
exactly by a spectral algorithm in the noiseless case and we derive several
convex relaxations for 2-SUM to improve the robustness of seriation solutions
in noisy settings. These convex relaxations also allow us to impose structural
constraints on the solution, hence solve semi-supervised seriation problems. We
derive new approximation bounds for some of these relaxations and present
numerical experiments on archeological data, Markov chains and DNA assembly
from shotgun gene sequencing data.Comment: Final journal version, a few typos and references fixe
- …